Thanks to my Data Science Professor Hector Corida Bravo and his notes. These helped me immensely for this project.
Also, please forgive my creative spelling. I grew up with spellcheck, and RStudio is running slow enough without an extra pugin running.
A popular source of argument among friends, enemies, frenemies, and family members (which could be in any of the previous categories) is the state of gun ownership and gun violence in the United States today. I know I have (cordially of course) discussed this issue a fair bit, especially after the Vegas and Parkland shootings. A common train of disagreement that tends to come from these conversations is the issue of regulation. Some people associated with a particular political leaning argue that more regulation leads to fewer guns which leads to less gun violence. Other people with the opposite political inclination argue that the regulations would only be followed by people who wouldn’t be committing gun crimes anyway, and the regulations would just give wrongdoers more incentive to do wrong without the fear of an armed bystander. However, rather than presenting unbiased information to the public, news sources have decided to join one side of the argument or the other and sensationalize the issues to make more money. On the regulation-is-good side of the argument (henceforth referred to as pro-regulation), there is this article and this more scholarly article from Stanford. On the other, regulation-is-bad side of the argument (henceforth the anti-regulation stance) is this article. Because, as mentioned before, news is sensationalist and we can’t believe anything they say, we will do the data ourselves. We will be getting our information on gun crime from this Kaggle data set and cross referencing it with this data set on gun laws. We will play with weights based on issues that matter to us using this supporting table. Finally, we will adjust our gun crime rates for population using a dataset from the census bureau. After we store this data in R and make it pretty and clean, we will visualize it with multiple charts, perform some regression work, and even use ML to predict future gun violence rates. Of course, I don’t know where this data came from exactly, it could come from people who are as bad as the news sources. Also, we must understand that there are several factors at work in an issue like this one, and just because a correlation was found doesn’t mean there is causation.
In terms of hypothesis testing, our null hypothesis is that there is no relation between gun violence and gun regulation, and our alternative hypothesis is that there is a relation between gun violence and gun regulation. More general hypothesis test information can be found here.
As a final thought, I wrote the preceding without doing any data analysis, so I will be just as surprised by the results of these procedures when I finish running them as you will be when you read this for the first time!
We will be doing this project in the language R in the RStudio. You can download RStudio here. If you want to get some more background information on R and RStudio, visit this page, which has links to various resources for familiarizing yourself with R. I will try to be very thorough here too, however.
Once you have R installed and have familiarized yourself with it to the extent you desire, sign up for a Kaggle account (I used my burner Google account and it was quite easy) and download the Kaggle data set and the data set on gun laws as CSV files. The Kaggle will download as a zipped folder with a CSV in it, you have to choose CSV from a list of options for the other data. The gun laws will default to only 2017, use the tools to the left to select years 2013-2017, then click the ‘CSV’ button. Click on this link to download the census dataset. You can open these files in Microsoft Excell or another spreadsheet program to inspect the data or look at the raw CSVs in a text editor to get a feel for what R will actually be seeing.
Once you have RStudio open, got to File->New Project and create a directory for this project. This is what I will call the project’s home directory, and all associated files for the project should be put in this folder.
We will load in the CSV for the Kaggle gun violence data first. If you haven’t already put the CSV in the project’s home directory, do so.
#This library has everything and is fantastic, I've used it for
#every R project I've done
library(tidyverse)
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
violence_full <- read.csv("gun-violence-data_01-2013_03-2018.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
violence_full %>% select(1:5)%>%head()
## incident_id date state city_or_county
## 1 461105 2013-01-01 Pennsylvania Mckeesport
## 2 460726 2013-01-01 California Hawthorne
## 3 478855 2013-01-01 Ohio Lorain
## 4 478925 2013-01-05 Colorado Aurora
## 5 478959 2013-01-07 North Carolina Greensboro
## 6 478948 2013-01-07 Oklahoma Tulsa
## address
## 1 1506 Versailles Avenue and Coursin Street
## 2 13500 block of Cerise Avenue
## 3 1776 East 28th Street
## 4 16000 block of East Ithaca Place
## 5 307 Mourning Dove Terrace
## 6 6000 block of South Owasso
As a note, all code will be commented with a max of two lines, and if I want to expound on my notes I will do it after the code block.
For example, the last line is what is called a pipeline. The ‘%>%’ operator takes the result of the thing in front of it and passes it as the first argument of the thing behind it. These pipes usually start with a dataset, in this case violence_full. Select picks columns, and this range means columns 1-5 inclusive. Head selects the first few rows of a dataset. If a line does nothing but create a value or mention a value without storing it, that value is printed.
Also, if you have not already installed a library, go to the packages tab (by default in the bottom right pane in RStudio), click the “install” button, type in the name of the package you are looking for, and click “Install.”
Now we will add the data set on gun laws and the census dataset. The government wasn’t kind enough to provide us with a CSV, so we will have to do a slightly different procedure for the .xlsx file they do provide.
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
gunlaws_full <- read.csv("raw_data.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
gunlaws_full %>% select(1:5)%>%head()
## state year age18longgunpossess age18longgunsale age21handgunpossess
## 1 Alabama 2013 0 0 0
## 2 Alaska 2013 0 1 0
## 3 Arizona 2013 1 0 0
## 4 Arkansas 2013 0 0 0
## 5 California 2013 0 1 0
## 6 Colorado 2013 0 0 0
#library for reading .xlsx
library(xlsx)
#the second argument in read.xlsx is the sheet index
#it can also be the sheet name
pops_full <- read.xlsx("nst-est2017-01.xlsx", 1)
pops_full %>% select(1:5)%>%head()
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
You will note that the dataframe for pops_full has a lot of junk in it. If you open it and the CSV for gun laws in Excell and compare them, you will notice that the .xlsx contains a lot more data than the gun laws CSV, and that most of this data will not be needed. We’ll deal with that in a bit.
If you want more information about reading in weird datafiles, check out this link.
If you look at the website for the categories, you’ll notice that there is no option to download the data. We will have to scrape it. As outlined in that link, we will have to tell R what sort of element it is looking for to collect the data from. If you don’t want to use the chrome extension the link talks about, you will have to inspect the element and select the class that that element is part of. You will put that in the argument of html_nodes. This will return a list of matches; you will have to select the one you want. In my case I wanted the first match it found.
#the library for scraping
library(rvest)
#read in webpage.
#usually a web address
webpage <- read_html("State Firearm Laws - Use the Database Glossary Search Tool.html")
cats_full <- webpage %>%
html_nodes(".table") %>%
.[[1]] %>%
html_table(fill = TRUE)
head(cats_full)
## Code
## 1 age18longgunpossess
## 2 age18longgunsale
## 3 age21handgunpossess
## 4 age21handgunsale
## 5 age21longgunpossess
## 6 age21longgunsale
## Definition
## 1 No possession of long guns until age 18
## 2 Purchase of long guns from licensed dealers and private sellers restricted to age 18 and older
## 3 No possession of handguns until age 21
## 4 Purchase of handguns from licensed dealers and private sellers restricted to age 21 and older
## 5 No possession of long guns until age 21
## 6 Purchase of long guns from licensed dealers and private sellers restricted to age 21 and older
## Category/Subcategory
## 1 Possession regulationsAge restrictions
## 2 Buyer regulationsAge restrictions
## 3 Possession regulationsAge restrictions
## 4 Buyer regulationsAge restrictions
## 5 Possession regulationsAge restrictions
## 6 Buyer regulationsAge restrictions
One snag I ran into here is that the page is dynamically loaded, so it only worked when I right clicked on the page and selected “Save page as…” and put the result in my project’s home direcory.
As mentioned before, there is a lot of junk in the pops_full dataset. We only want the data and done of the fluff. I’ll remind you what this dataset looks like:
head(pops_full)
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
## NA..4 NA..5 NA..6 NA..7 NA..8 NA..9
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 2012 2013 2014 2015 2016 2017
## 4 313993272 316234505 318622525 321039839 323405935 325719178
## 5 55860261 56047732 56203078 56296628 56359360 56470581
## 6 67318295 67534451 67720120 67839187 67978168 68179351
You’ll notice that we only want a few of the columns (called atributes): the ones with the state names and the dates. You will also notice that there are some extra rows (called entities). The following code trims them down.
#library for text manipulation
library(stringr)
#the arrow is a storage operator
pops_trimmed <- pops_full %>%
#removes attributes 2 and 3
select(-(2:3)) %>%
#removes first 8 entities
slice(-(1:8)) %>%
#removes last 6 entities
slice(-(52:58) ) %>%
#change the column names (temporarily for ease of use)
#the ticks let us use code we wouldn't usually put in a pipeline
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017")) %>%
#remove the first character in the string for State
mutate(State = str_sub(State, 2, 25)) %>%
#change 2010 (ticks because its a number) to numeric data
transform(`2010` = as.double(as.character(`2010`))) %>%
#change the column names (again...)
#try without this to see what happens.
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017"))
head(pops_trimmed)
## State 2010 2011 2012 2013 2014 2015
## 1 Alabama 4785579 4798649 4813946 4827660 4840037 4850858
## 2 Alaska 714015 722259 730825 736760 736759 737979
## 3 Arizona 6407002 6465488 6544211 6616124 6706435 6802262
## 4 Arkansas 2921737 2938640 2949208 2956780 2964800 2975626
## 5 California 37327690 37672654 38019006 38347383 38701278 39032444
## 6 Colorado 5048029 5116411 5186330 5262556 5342311 5440445
## 2016 2017
## 1 4860545 4874747
## 2 741522 739795
## 3 6908642 7016270
## 4 2988231 3004279
## 5 39296476 39536653
## 6 5530105 5607154
You will notice that I changed the state names to remove the periods, changed the attribute names, and changed the 2010 attribute to neumaric (helpful later, when we want them to be numbers: it was treating it as text). For more information on string manipulation, check this link.
However, we’re not done. We want to be able to join the violence dataset to this based on year and state eventually. This dataset as it stands is not very condusive to this, it would be better if there was a separate attribute for each state in each year. This also plays into the idea of tidy data (incidentally, this seems to be the paper Professor Bravo used for his class notes–might want to cite this Professor).
#gather(dataset, what attribute names turns into, what the data turns into,
# what not to include)
pops_tidy <- gather(pops_trimmed, year, population, -State)
head(pops_tidy)
## State year population
## 1 Alabama 2010 4785579
## 2 Alaska 2010 714015
## 3 Arizona 2010 6407002
## 4 Arkansas 2010 2921737
## 5 California 2010 37327690
## 6 Colorado 2010 5048029
Also, the violence_full dataset has more attributes than we need, and the date attribute is not seen by R as a date. To use it effectivly as a date, we would have to “date” attribute to a date attribute. This involves some fanciness, and we really don’t need the full date for this project, so we won’t bother. We will create a year attribute for use when joining with the population data for analysis instead.
violence_trimmed <- violence_full %>%
#select the 4 attributes we want
select(date, state, longitude, latitude) %>%
mutate(date = str_sub(date, 1, 4))
head(violence_trimmed)
## date state longitude latitude
## 1 2013 Pennsylvania -79.8559 40.3467
## 2 2013 California -118.3330 33.9090
## 3 2013 Ohio -82.1377 41.4455
## 4 2013 Colorado -104.8020 39.6518
## 5 2013 North Carolina -79.9569 36.1140
## 6 2013 Oklahoma -95.9768 36.2405
This dataset is really cool and I regret chopping out this much data. I am sure someone particularly motivated could do some really cool stuff with all the data this thing has. If you haven’t looked more closely at it yet, do so!
Finally, you’ll notice that the cats_full dataset has two attributes in one column. We need to split this.
extract(cats_full, 'Category/Subcategory', into = c("category", "subcategory"), '(.+)([A-Z].+)' )
## Code
## 1 age18longgunpossess
## 2 age18longgunsale
## 3 age21handgunpossess
## 4 age21handgunsale
## 5 age21longgunpossess
## 6 age21longgunsale
## 7 age21longgunsaled
## 8 alcoholism
## 9 alctreatment
## 10 amm18
## 11 amm21h
## 12 ammbackground
## 13 ammlicense
## 14 ammpermit
## 15 ammrecords
## 16 ammrestrict
## 17 assault
## 18 assaultlist
## 19 assaultregister
## 20 assaulttransfer
## 21 backgroundpurge
## 22 cap14
## 23 cap16
## 24 cap18
## 25 capaccess
## 26 capliability
## 27 capunloaded
## 28 capuses
## 29 ccbackground
## 30 ccbackgroundnics
## 31 ccrenewbackground
## 32 ccrevoke
## 33 college
## 34 collegeconcealed
## 35 danger
## 36 dealer
## 37 dealerh
## 38 defactoreg
## 39 defactoregh
## 40 drugmisdemeanor
## 41 dvro
## 42 dvrodating
## 43 dvroremoval
## 44 dvrosurrender
## 45 dvrosurrenderdating
## 46 dvrosurrendernoconditions
## 47 elementary
## 48 exparte
## 49 expartedating
## 50 expartesurrender
## 51 expartesurrenderdating
## 52 expartesurrendernoconditions
## 53 felony
## 54 fingerprint
## 55 gunshow
## 56 gunshowh
## 57 gvro
## 58 gvrolawenforcement
## 59 immunity
## 60 incidentall
## 61 incidentremoval
## 62 inspection
## 63 invcommitment
## 64 invoutpatient
## 65 junkgun
## 66 liability
## 67 lockd
## 68 locked
## 69 lockp
## 70 lockstandards
## 71 loststolen
## 72 magazine
## 73 magazinepreowned
## 74 mayissue
## 75 mcdv
## 76 mcdvdating
## 77 mcdvremovalallowed
## 78 mcdvremovalrequired
## 79 mcdvsurrender
## 80 mcdvsurrenderdating
## 81 mcdvsurrendernoconditions
## 82 mentalhealth
## 83 microstamp
## 84 nosyg
## 85 onefeature
## 86 onepermonth
## 87 opencarryh
## 88 opencarryl
## 89 opencarrypermith
## 90 opencarrypermitl
## 91 permit
## 92 permitconcealed
## 93 permith
## 94 permitlaw
## 95 personalized
## 96 preemption
## 97 preemptionbroad
## 98 preemptionnarrow
## 99 purge
## 100 recordsall
## 101 recordsallh
## 102 recordsdealer
## 103 recordsdealerh
## 104 registration
## 105 registrationh
## 106 reportall
## 107 reportallh
## 108 reportdealer
## 109 reportdealerh
## 110 residential
## 111 security
## 112 showing
## 113 stalking
## 114 statechecks
## 115 statechecksh
## 116 strawpurchase
## 117 strawpurchaseh
## 118 tenroundlimit
## 119 theft
## 120 threedaylimit
## 121 traffickingbackground
## 122 traffickingprohibited
## 123 traffickingprohibitedh
## 124 training
## 125 universal
## 126 universalh
## 127 universalpermit
## 128 universalpermith
## 129 violent
## 130 violenth
## 131 violentpartial
## 132 waiting
## 133 waitingh
## Definition
## 1 No possession of long guns until age 18
## 2 Purchase of long guns from licensed dealers and private sellers restricted to age 18 and older
## 3 No possession of handguns until age 21
## 4 Purchase of handguns from licensed dealers and private sellers restricted to age 21 and older
## 5 No possession of long guns until age 21
## 6 Purchase of long guns from licensed dealers and private sellers restricted to age 21 and older
## 7 Purchase of long guns from licensed dealers restricted to age 21 and older
## 8 Firearm possession is prohibited for some people with alcoholism
## 9 Firearm possession is prohibited for some people with alcohol-related problems
## 10 Purchase of any type of ammunition restricted to age 18 and older
## 11 Purchase of handgun ammunition restricted to age 21 and older
## 12 Background checks required for ammunition purchase
## 13 Vendor license required to sell ammunition
## 14 Permit required to purchase ammunition
## 15 Records of ammunition sales must be retained
## 16 All of the state’s high-risk gun possession prohibitions also apply to ammunition possession
## 17 Ban on sale of assault weapons beyond just assault pistols
## 18 Ban on sale of assault weapons which includes a list of banned weapons
## 19 Grandfathered weapons must be registered
## 20 Transfer of grandfathered weapons is prohibited
## 21 State can retain background check records for at least 60 days
## 22 Criminal liability for negligent storage applies to access by children less than 14 years old
## 23 Criminal liability for negligent storage applies to access by children less than 16 years old
## 24 Criminal liability for negligent storage applies to access by children less than 18 years old
## 25 Criminal liability for negligent storage of guns if child gains access
## 26 Criminal liability for negligent storage of guns, regardless of whether child gains access
## 27 Criminal liability for negligent storage applies regardless of whether gun is loaded or unloaded
## 28 Criminal liability for negligent storage of guns if child uses or carries the gun
## 29 Concealed carry permit process requires a background check
## 30 Background check process for a concealed carry permit explicitly requires a check of the NICS database
## 31 Concealed carry permit renewal requires a new background check
## 32 Authorities are required to revoke concealed carry permits under certain circumstances
## 33 No gun carrying allowed on college campuses except for concealed weapon permittees
## 34 No gun carrying on college campuses, including concealed weapons permittees
## 35 Firearm possession is prohibited if person is deemed by court to be a danger to oneself or others
## 36 State dealer license required for sale of all firearms
## 37 State dealer license required for sale of handguns
## 38 De facto registration of firearms is in place because of a recordkeeping requirement for all gun sales
## 39 De facto registration of handguns is in place because of a recordkeeping requirement for all handgun sales
## 40 Firearm possession is prohibited for people with a drug misdemeanor conviction
## 41 State law automatically prohibits domestic violence-related restraining order (DVRO) subjects from possessing firearms
## 42 DVROs are automatically prohibiting if the subject is a dating partner of the petitioner
## 43 Law enforcement officials are required to remove firearms from people subject to a domestic violence-related restraining order
## 44 State law requires DVRO subjects to surrender their firearms
## 45 The surrender provisions apply if the subject is a dating partner of the petitioner
## 46 No additional finding is required before the firearm surrender provisions apply
## 47 No gun carrying on elementary school property, including concealed weapons permittees
## 48 Ex parte (temporary) DVRO subjects are automatically prohibited from possessing firearms
## 49 Ex parte DVROs are prohibiting if the petitioner is a dating partner of the DVRO subject
## 50 State law requires ex parte DVRO subjects to surrender their firearms
## 51 The ex parte DVRO surrender provisions apply if the subject is a dating partner of the petitioner
## 52 No additional finding is required before the ex parte DVRO firearm surrender provisions apply
## 53 Firearm possession is prohibited for people with a felony conviction
## 54 Buyers must be fingerprinted at point of purchase
## 55 Background checks required for all gun show firearm sales at point of purchase
## 56 Background checks required for gun show handgun sales at point of purchase
## 57 Family members or law enforcement officers can confiscate firearms from any person who is deemed by a judge to represent a threat to themselves or others
## 58 Law enforcement officers can confiscate firearms from any person who is deemed by a judge to represent a threat to themselves or others
## 59 No law provides blanket immunity to gun manufacturers or prohibits state or local lawsuits against gun manufacturers
## 60 All firearms must be removed from the scene of a domestic violence incident
## 61 State law requires law enforcement to remove firearms from the scene of a domestic violence incident
## 62 Mandatory police inspections of dealers
## 63 Firearm possession is prohibited for people who have been involuntarily committed to an inpatient facility
## 64 Firearm possession is prohibited for people who have been involuntarily committed to an outpatient facility
## 65 Ban on junk guns (sometimes called "Saturday night specials")
## 66 Dealers are liable for damages resulting from illegal gun sales
## 67 Safety lock required for handguns sold through licensed dealers
## 68 All firearms in a household must be stored securely (locked away) at all times
## 69 Safety lock required for handguns sold through all dealers
## 70 Safety lock is required for handguns and must be approved by state standards
## 71 Mandatory reporting of lost and stolen guns by firearm owners
## 72 Ban on sale large capacity magazines beyond just ammunition for pistols
## 73 Possession of pre-owned large capacity magazines is prohibited
## 74 "May issue" state
## 75 People convicted of a misdemeanor crime of domestic violence against a spouse, ex-spouse, or cohabitating partner are prohibited from possessing firearms
## 76 All people convicted of a misdemeanor crime of domestic violence are prohibited from possessing firearms
## 77 State law allows law enforcement to remove firearms from MCDV offenders
## 78 State law requires law enforcement officers to remove firearms from MCDV offenders
## 79 People convicted of a misdemeanor crime of domestic violence against a spouse, ex-spouse, or cohabitating partner are required to surrender their firearms
## 80 The surrender provisions apply if the defendant is a dating partner of the victim
## 81 People convicted of a misdemeanor crime of domestic violence against a spouse, ex-spouse, or cohabitating partner are required to surrender their firearms with no exceptions
## 82 Required background checks include an explicit requirement for search of state mental health records
## 83 All handguns sold must have either ballistic fingerprinting or microstamping so that they can be identified if used in a crime
## 84 No stand your ground law
## 85 Ban on sale of assault weapons using a one-feature definition
## 86 Buyers can purchase no more than one handgun per month with no or limited exceptions
## 87 No open carry of handguns is allowed in public places
## 88 No open carry of long guns is allowed in public places
## 89 No open carry of handguns is allowed in public places unless the person has a concealed carry or handgun carry permit
## 90 No open carry of long guns is allowed in public places unless the person has a permit
## 91 A license or permit is required to purchase all firearms
## 92 Permit required to carry concealed weapons
## 93 A license or permit is required to purchase handguns
## 94 Permit process involves law enforcement
## 95 State has a law that requires review of personalized gun technology
## 96 State law does not preempt local regulation of firearms in any way
## 97 State law does not completely preempt local regulation of firearms
## 98 Any state law that preempts local regulation of firearms is narrow in its scope (i.e., in one area of regulation)
## 99 Dealers can retain sales records for at least 60 days after firearm purchase
## 100 All private sellers and licensed dealers are required to keep and retain records of all firearm sales
## 101 All private sellers and licensed dealers are required to keep and retain records of handgun sales
## 102 Licensed dealers are required to keep and retain records of all firearm sales
## 103 Licensed dealers are required to keep and retain records of handgun sales
## 104 Gun owners must register their firearms with the state
## 105 Gun owners must register their handguns with the state
## 106 All private sellers and licensed dealers are required to report all firearm sales records to the state
## 107 All private sellers and licensed dealers are required to report handgun sales records to the state
## 108 Licensed dealers are required to report all firearm sales records to the state
## 109 Licensed dealers are required to report handgun sales records to the state
## 110 Ban on non-commercial dealers
## 111 State requires at least one store security precaution for firearm dealers
## 112 Applicants are required to make a heightened showing to obtain a concealed carry permit
## 113 A stalking conviction is prohibitive for firearm possession
## 114 State conducts separate background checks, beyond the federal NICS check, for all firearms
## 115 State conducts separate background checks, beyond the federal NICS check, for handguns
## 116 No person may purchase a firearm on behalf of another person
## 117 No person may purchase a handgun on behalf of another person
## 118 No magazines with a capacity of more than 10 rounds of ammunition may be sold
## 119 Mandatory reporting of stolen guns by all firearm dealers
## 120 Background checks for gun sales or permits have more than a three day period in which they can be completed
## 121 No person may purchase a firearm with the intent to re-sell without the buyer going through a background check or having already gone through a background check
## 122 No person may purchase a firearm with the intent to re-sell to a person who is prohibited from buying or possessing a firearm
## 123 No person may purchase a handgun with the intent to re-sell to a person who is prohibited from buying or possessing a firearm
## 124 Safety training or testing required prior to issuing a firearm license or permit
## 125 Universal background checks required at point of purchase for all firearms
## 126 Universal background check required at point of purchase for handguns
## 127 Background checks conducted through permit requirement for all firearm sales (or universal background checks)
## 128 Background checks conducted through permit requirement for all handgun sales (or universal background checks)
## 129 Firearm possession is prohibited for people who have committed a violent misdemeanor punishable by less than one year of imprisonment
## 130 Handgun possession is prohibited for people who have committed a violent misdemeanor punishable by less than one year of imprisonment
## 131 Firearm possession is prohibited for people who have committed a violent misdemeanor punishable by more than one year of imprisonment
## 132 Waiting period is required on all firearm purchases from dealers
## 133 Waiting period is required on all handgun purchases from dealers
## category
## 1 Possession regulations
## 2 Buyer regulations
## 3 Possession regulations
## 4 Buyer regulations
## 5 Possession regulations
## 6 Buyer regulations
## 7 Buyer regulations
## 8 Prohibitions for high-risk gun possession
## 9 Prohibitions for high-risk gun possession
## 10 Ammunition regulations
## 11 Ammunition regulations
## 12 Ammunition regulations
## 13 Ammunition regulations
## 14 Ammunition regulations
## 15 Ammunition regulations
## 16 Ammunition regulations
## 17 Assault weapons and large-capacity magazines
## 18 Assault weapons and large-capacity magazines
## 19 Assault weapons and large-capacity magazines
## 20 Assault weapons and large-capacity magazines
## 21 Background checks
## 22 Child access prevention
## 23 Child access prevention
## 24 Child access prevention
## 25 Child access prevention
## 26 Child access prevention
## 27 Child access prevention
## 28 Child access prevention
## 29 Concealed carry permitting
## 30 Concealed carry permitting
## 31 Concealed carry permitting
## 32 Concealed carry permitting
## 33 Possession regulations
## 34 Possession regulations
## 35 Prohibitions for high-risk gun possessionMental
## 36 Dealer regulations
## 37 Dealer regulations
## 38 Buyer regulations
## 39 Buyer regulations
## 40 Prohibitions for high-risk gun possession
## 41 Domestic violence
## 42 Domestic violence
## 43 Domestic violence
## 44 Domestic violence
## 45 Domestic violence
## 46 Domestic violence
## 47 Possession regulations
## 48 Domestic violence
## 49 Domestic violence
## 50 Domestic violence
## 51 Domestic violence
## 52 Domestic violence
## 53 Prohibitions for high-risk gun possession
## 54 Buyer regulations
## 55 Background checks
## 56 Background checks
## 57 Possession regulations
## 58 Possession regulations
## 59 Immunity
## 60 Domestic violence
## 61 Domestic violence
## 62 Dealer regulations
## 63 Prohibitions for high-risk gun possessionMental
## 64 Prohibitions for high-risk gun possessionMental
## 65 Dealer regulations
## 66 Dealer regulations
## 67 Child access prevention
## 68 Child access prevention
## 69 Child access prevention
## 70 Child access prevention
## 71 Buyer regulations
## 72 Assault weapons and large-capacity magazines
## 73 Assault weapons and large-capacity magazines
## 74 Concealed carry permitting
## 75 Domestic violence
## 76 Domestic violence
## 77 Domestic violence
## 78 Domestic violence
## 79 Domestic violence
## 80 Domestic violence
## 81 Domestic violence
## 82 Background checks
## 83 Gun trafficking
## 84 Stand your ground
## 85 Assault weapons and large-capacity magazines
## 86 Buyer regulations
## 87 Possession regulations
## 88 Possession regulations
## 89 Possession regulations
## 90 Possession regulations
## 91 Buyer regulations
## 92 Concealed carry permitting
## 93 Buyer regulations
## 94 Buyer regulations
## 95 Gun trafficking
## 96 Preemption
## 97 Preemption
## 98 Preemption
## 99 Dealer regulations
## 100 Dealer regulations
## 101 Dealer regulations
## 102 Dealer regulations
## 103 Dealer regulations
## 104 Buyer regulations
## 105 Buyer regulations
## 106 Dealer regulations
## 107 Dealer regulations
## 108 Dealer regulations
## 109 Dealer regulations
## 110 Dealer regulations
## 111 Dealer regulations
## 112 Concealed carry permitting
## 113 Domestic violence
## 114 Background checks
## 115 Background checks
## 116 Gun trafficking
## 117 Gun trafficking
## 118 Assault weapons and large-capacity magazines
## 119 Dealer regulations
## 120 Background checks
## 121 Gun trafficking
## 122 Gun trafficking
## 123 Gun trafficking
## 124 Buyer regulations
## 125 Background checks
## 126 Background checks
## 127 Background checks
## 128 Background checks
## 129 Prohibitions for high-risk gun possessionViolent
## 130 Prohibitions for high-risk gun possessionViolent
## 131 Prohibitions for high-risk gun possessionViolent
## 132 Buyer regulations
## 133 Buyer regulations
## subcategory
## 1 Age restrictions
## 2 Age restrictions
## 3 Age restrictions
## 4 Age restrictions
## 5 Age restrictions
## 6 Age restrictions
## 7 Age restrictions
## 8 Alcohol
## 9 Alcohol
## 10 Age restrictions
## 11 Age restrictions
## 12 Background checks
## 13 Licensing
## 14 Permitting
## 15 Recordkeeping
## 16 Prohibitors
## 17 Assault weapons ban
## 18 Assault weapons ban
## 19 Assault weapons ban
## 20 Assault weapons ban
## 21 Background check records
## 22 Storage
## 23 Storage
## 24 Storage
## 25 Storage
## 26 Storage
## 27 Storage
## 28 Storage
## 29 Background checks
## 30 Background checks
## 31 Background checks
## 32 Permitting
## 33 Campus carry
## 34 Campus carry
## 35 Health
## 36 Licensing
## 37 Licensing
## 38 Registration
## 39 Registration
## 40 Drugs
## 41 Restraining order
## 42 Restraining order
## 43 Restraining order
## 44 Restraining order
## 45 Restraining order
## 46 Restraining order
## 47 School zones
## 48 Restraining order
## 49 Restraining order
## 50 Restraining order
## 51 Restraining order
## 52 Restraining order
## 53 Felony
## 54 Fingerprinting
## 55 Gun shows
## 56 Gun shows
## 57 Gun violence restraining orders
## 58 Gun violence restraining orders
## 59 Immunity
## 60 Firearm removal
## 61 Firearm removal
## 62 Inspections
## 63 Health
## 64 Health
## 65 Junk guns
## 66 Liability
## 67 Safety locks
## 68 Storage
## 69 Safety locks
## 70 Safety locks
## 71 Theft reporting
## 72 Large capacity magazine ban
## 73 Large capacity magazine ban
## 74 Permitting
## 75 Misdemeanor crimes
## 76 Misdemeanor crimes
## 77 Misdemeanor crimes
## 78 Misdemeanor crimes
## 79 Misdemeanor crimes
## 80 Misdemeanor crimes
## 81 Misdemeanor crimes
## 82 Background checks - mental health records
## 83 Crime gun identification
## 84 Stand your ground
## 85 Assault weapons ban
## 86 Bulk purchase limit
## 87 Open carry
## 88 Open carry
## 89 Open carry
## 90 Open carry
## 91 Permitting
## 92 Permitting
## 93 Permitting
## 94 Permitting
## 95 Personalized gun technology
## 96 Preemption
## 97 Preemption
## 98 Preemption
## 99 Reporting
## 100 Recordkeeping
## 101 Recordkeeping
## 102 Recordkeeping
## 103 Recordkeeping
## 104 Registration
## 105 Registration
## 106 Reporting
## 107 Reporting
## 108 Reporting
## 109 Reporting
## 110 Location
## 111 Security
## 112 Permitting
## 113 Stalking
## 114 Background checks - state records
## 115 Background checks - state records
## 116 Straw purchase
## 117 Straw purchase
## 118 Large capacity magazine ban
## 119 Theft reporting
## 120 Background checks time limit
## 121 Gun trafficking
## 122 Gun trafficking
## 123 Gun trafficking
## 124 Safety training
## 125 Universal background checks
## 126 Universal background checks
## 127 Background checks through permits
## 128 Background checks through permits
## 129 Misdemeanor
## 130 Misdemeanor
## 131 Misdemeanor
## 132 Waiting period
## 133 Waiting period
This uses Regex to separate the column into two different columns based on the capital letter.
Because we are taking so long to get to our final data, and because we now have a hyper cool dataset with latitude and longitude, we are going to make cool maps! Hooray!
One thing about this is, as a tangent, I will not be going into extreme detail about this section.
Here, we will make an ultra-cool layered map, with a layer per year of data, showing the location of the reports of gun violence.
#the cool library for making maps
library(leaflet)
#get rid of entities with NA values
violence_map <- na.omit(violence_trimmed)
#start map pipeline
markermap <- leaflet(violence_map) %>%
addTiles() %>%
#get longitute of one year
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(longitude)))},
#same for latitude
~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(latitude)))},
radius = 2,
color = 'red',
#assign this layer an identifier
group="2013")%>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(latitude)))},
radius = 2,
color = 'orange',
group="2014") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(latitude)))},
radius = 2,
color = 'green',
group="2015") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(latitude)))},
radius = 2,
color = 'blue',
group="2016") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(latitude)))},
radius = 2,
color = 'purple',
group="2017") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(latitude)))},
radius = 2,
color = 'black',
group="2018") %>%
addLayersControl(
#gather all the layers togeather.
overlayGroups = c("2013", "2014", "2015", "2016", "2017", "2018"),
options = layersControlOptions(collapsed = FALSE)
) %>%
#start with only the 2013 layer showing.
hideGroup("2014") %>%
hideGroup("2015") %>%
hideGroup("2016") %>%
hideGroup("2017") %>%
hideGroup("2018")
#show it!
markermap
I know that this probably isn’t the most efficient way to write it, but it works…barely. This map is really slow. I recommend when manipulating it you turn off all the layers, turning them back on when the view is where you would like it to be. Usually one would add labels to the data, but it is slow enough as it is I’m not going to push it. Seriously, when trying to make this, I had to restart RStudio about 10 times as I made it less and less fancy (I started with awesome markers, labels, and the like). It just goes to show how big this dataset is.
The map for 2013 is relativly tame. There are a lot of dots, especially in urban areas, and it was less than what I expected. However, when you turn on the filter for any other year, the map just exploads. It blew my mind how many firearm-related incidents were recorded in this dataset. Even for year 2018, which isn’t done yet, has a ton of datapoints. It is cool to zoom to your hometown or an urban center and turn the filters on one by one and see how much of this is going on. As I alluded to in the last paragraph, as flippant as I may be in the introduction paragraphs, this is a really serious issue that needs to be approached with great thought, but more importaintly, action to go with it.
Relating to this project, it makes me wonder if a lot of data is missing for 2013. The Kaggle site reports that this data has “all recorded gun violence incidents in the US between January 2013 and March 2018, inclusive,” but the organization it got its data form only has maps since 2014 and this questionable site shows no dramatic increase between 2013 and 2014. Therefore, I will only use 2014-2107, the complete years, for analysis.
You can find more about the leaflet here.
Here, we are going to finally answer the question of whether or not regulation has an impact on gun violence (in our somewhat flawed, generalized way). We are, unfortunatly, going to have to modify some data sets some more. gunlaws_full already has a total_laws column, so we don’t have to do any more processing on that beyond slicing out all of the middle rows. Population is already mostly ready from the work above. Turns out, though, that we have to remove DC from population because the gun laws dataset doesn’t include DC and remove all years except 2014-2017 inclusive becuase of the violence dataset.
However, our incident database is not finished. It is a list of incidents, we want it to become a list of all of the incidents that happen in a certain state in a certain year. This website demonstrates how to do various aggegations and tallies. A very interesting but seemingly still accurate website has some good tips for how to aggregate data like this for an old library that has been updated, but I like the name of the site so I left it in anyways.
#count separate occurences of each pair of date and state
violence_summ <- violence_trimmed %>%
count(date, state)
#select only the good data, unfortunatly have to remove DC...
violence_summ <- violence_summ %>% filter(2014 <= date & date <= 2017 & state != "District of Columbia")
head(violence_summ)
## # A tibble: 6 x 3
## date state n
## <chr> <fct> <int>
## 1 2014 Alabama 1318
## 2 2014 Alaska 146
## 3 2014 Arizona 556
## 4 2014 Arkansas 572
## 5 2014 California 3732
## 6 2014 Colorado 556
I should have mentioned before what the c() function does! It takes its arguments and makes a vector out of them. You saw this before when I was renaming the columns in the population sheet and when grouping the layers for the map.
Now that this data is in the right format, we must join the population and law counts to it. A join takes two datasets and combines them on rows that are the same. In this example, each dataset has an entity with a unique date and state. Because all of our datasets will have rows that correspond exactly, it is not neccessary to pick a specific type of join, but the types are discussed in detail here.
First though, fix up the laws and the population datasets for the final time…
gunlaws_trimmed <- gunlaws_full %>%
transform(year = as.character(year)) %>%
select(state, year, lawtotal) %>%
filter(2014 <= year & year <= 2017 & state != "District of Columbia")
pops_tidy_trimmed <- pops_tidy %>%
filter(2014 <= year & year <= 2017 & State != "District of Columbia")
combined <- violence_summ %>%
#join on two conditions
left_join(gunlaws_trimmed, by = c("state" = "state", "date" = "year")) %>%
#join again on two conditions
left_join(pops_tidy_trimmed, by = c("state" = "State", "date" = "year")) %>%
#get a incidents_per_person number
mutate(per_pop = n/population)
head(combined)
## # A tibble: 6 x 6
## date state n lawtotal population per_pop
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 2014 Alabama 1318 10 4840037 0.000272
## 2 2014 Alaska 146 4 736759 0.000198
## 3 2014 Arizona 556 11 6706435 0.0000829
## 4 2014 Arkansas 572 11 2964800 0.000193
## 5 2014 California 3732 100 38701278 0.0000964
## 6 2014 Colorado 556 30 5342311 0.000104
Now that we finally have a good dataset, we need to make sure the preconditions to linear regression are met. We will call lawtotal the independant variable as its increase is argued by some to decrease firearm incidents, we will call it the independant variable and the per-population incidence rate the dependant variable.
Let us look at our graph first (using ggplot: here is some mind-numbing information on it):
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()
This data seems like it might slightly decrease with the number of regulation laws, but this is not very clear cut. One way to try and make data more linear is to take the log of the dependant variable. We will try that.
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=log(per_pop))) +
#make labels
labs(x = "Number of Regulation Laws", y = "Log Incidents per Capita", "title" = "Fire Arm Regulation vs. Log Firearm Incidents") +
#make a dotplot
geom_point()
This has centered the data in the graph, but not made it much better. In fact, I would say that it has been made worse.
Reverting back to the original, not-log’d data, we will generate some more graphs to try and see if these conditions are met. First, we have to make a regression model and then feed it into the plot function.
#lm's formula is response[dependant]~terms[indepenent]
fit = lm(per_pop~lawtotal, data=combined)
#the which part selects which plots to show
plot(fit, which=1:2)
Also, we can use some functions from the broom library to get some stats on the regression.
library(broom)
glance(fit)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.02851464 0.02360814 8.895388e-05 5.811614 0.01683451 2 1582.696
## AIC BIC deviance df.residual
## 1 -3159.392 -3149.497 1.566733e-06 198
tidy(fit)
## term estimate std.error statistic p.value
## 1 (Intercept) 2.096411e-04 9.073062e-06 23.105876 4.082934e-58
## 2 lawtotal -5.844784e-07 2.424488e-07 -2.410729 1.683451e-02
After looking at this output and compairing it to the expectations of a good fit set out here, I would say that this is not a good regression line. The residuals in the “Residuals vs. Fitted” graph are all clustered to one side; with a good regression line, there would be no decernable pattern. The Normal Q-Q plot seems to show a curve: a good regression will produce normally distributed resids, which will appear as a horizontal line on that plot. The \(R^2\) value is the percentage of the dependent variables that can be explained by the dependent. In this case, it is about 3%, which is very low. A good regression would have an \(R^2\) value of about 70%. The p-value of the slope is reasonably small (about .017), but I am still wary due to the other factors mentioned.
Even though this line is not very good, we can still plot it. The dark line is the regression itself, the light blue around it is a 95% confidence interval for the line.
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()+
#add prediction
stat_smooth(method="lm")
We see it is decreasing slightly.
However, due to the inconsistancies with the requirements for linear regression listed above, I do not believe that linear regression is the best choice. As far as regression goes, linear was about all that was discussed in class, so I am going to shrug my shoulders and give up on regression. I will though show you what R thinks is the best regression for this data:
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()+
#add prediction
geom_smooth()
See how this line is all wavy? That relationship is not very linear.
Again, since this is a bit of a tangent, I will not be as explicit about my process as I was before. Here, we will attempt to train a decision tree and a random forest to predict whether gun violence will go up or down. A decision tree branches based on the value of a parameter with certain thesholds that are decided by iterations of training data. A random forest is multiple decision trees trained on random subsets of a dataset. To get a more accurate picture of this method’s true performance, we will perform k-fold cross validation. This will train multiple trees and RFs and average the results.
First, we will take the combined dataset form above and try to manipulate it into a form more useful for k-fold cross validation. For the reasons discussed above, we only have the years 2014-2017 to work with. This means 2017 will be our test data and we will only have the two intervals from 2014-2015 and 2015-2016 to train on. This will not be very effective, but it will at least show you the general process for doing your own example with better data. It will be interesting to see how relativly terribly the RF and its extra work will be compared to the decision tree.
#get the results for the 2016-2017 differences
#this will be used as tests for how well the predictor learned the other two gaps.
outcome_df <- combined %>%
#pick the last gap.
filter(date %in% c("2016", "2017")) %>%
select(state, date, per_pop) %>%
#create atributes for per_pop for each year
spread(date, per_pop) %>%
#find the difference in pop
mutate(diff = `2017` - `2016`) %>%
#get a binary difference
mutate(Direction = ifelse(diff>0, "up", "down")) %>%
select(state, Direction)
head(outcome_df)
## # A tibble: 6 x 2
## state Direction
## <chr> <chr>
## 1 Alabama up
## 2 Alaska down
## 3 Arizona up
## 4 Arkansas up
## 5 California up
## 6 Colorado down
predictor_df <- combined %>%
#get the dates before 2016
filter(date <= 2016) %>%
select(date, state, per_pop) %>%
spread(date, per_pop)
head(predictor_df)
## # A tibble: 6 x 4
## state `2014` `2015` `2016`
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 0.000272 0.000213 0.000269
## 2 Alaska 0.000198 0.000515 0.000600
## 3 Arizona 0.0000829 0.0000706 0.0000802
## 4 Arkansas 0.000193 0.000181 0.000241
## 5 California 0.0000964 0.0000829 0.0000920
## 6 Colorado 0.000104 0.000145 0.000147
#create offset matrix on left
matrix_1 <- predictor_df %>%
select(-state) %>%
as.matrix() %>%
.[,-1]
#create offset matrix on right
matrix_2 <- predictor_df %>%
select(-state) %>%
as.matrix() %>%
.[,-ncol(.)]
#get differences
diff_df <- (matrix_1 - matrix_2) %>%
magrittr::set_colnames(NULL) %>%
as_data_frame() %>%
mutate(state = predictor_df$state)
#put the 2016-2017 outcomes back in
final_df <- diff_df %>%
inner_join(outcome_df %>% select(state, Direction), by="state") %>%
mutate(Direction=factor(Direction, levels=c("down", "up")))
head(final_df)
## # A tibble: 6 x 4
## V1 V2 state Direction
## <dbl> <dbl> <chr> <fct>
## 1 -0.0000592 0.0000562 Alabama up
## 2 0.000317 0.0000852 Alaska down
## 3 -0.0000123 0.00000962 Arizona up
## 4 -0.0000121 0.0000601 Arkansas up
## 5 -0.0000136 0.00000919 California up
## 6 0.0000408 0.00000199 Colorado down
(Ok yes much of this is adapted from a homework. However, it still applies here.)
In the code above, two matricies are produced that are offset (ex. matrix_1 does not include the first quarter). This way, when one is suptracted from the other, a matrix of differences in affordability will be produced. This is then joined back with the states and true directions from the 2016-2017 data, which will be used to test the models’ training against.
Now that we have a dataframe of differences, we need to run our k-fold cross validation on it. I have arbitrarily picked k=5.
#end digits of an old phone number
set.seed(4352)
#for the createFolds tree and RF Functions, along with
#predictor
library(randomForest)
library(caret)
library(ROCR)
library(tree)
# create the cross-validation partition, k = # of folds
result_df <- createFolds(final_df$Direction, k=5) %>%
# fit models and gather results. This will run 5 times, one for each fold
purrr::imap(function(test_indices, fold_number) {
# split into train and test for the fold
train_df <- final_df %>%
select(-state) %>%
slice(-test_indices)
test_df <- final_df %>%
select(-state) %>%
slice(test_indices)
# fit the two models
rf <- randomForest(Direction~., data=train_df)
tr <- tree(Direction~., data=train_df)
# gather results
test_df %>%
select(observed_label = Direction) %>%
mutate(fold=fold_number) %>%
mutate(prob_positive_rf = predict(rf, newdata=test_df, type="prob")[,"up"]) %>%
# add predicted labels for rf using a 0.52 probability cutoff
mutate(predicted_label_rf = ifelse(prob_positive_rf > 0.52, "up", "down")) %>%
#WHY CAN'T THEY STANDARDIZE THIS?
mutate(prob_positive_tr = predict(tr, newdata=test_df)[,"up"]) %>%
# add predicted labels for tr using a 0.5 probability cutoff
mutate(predicted_label_tr = ifelse(prob_positive_tr > 0.5, "up", "down"))
}) %>%
# combine the five result data frames into one
purrr::reduce(bind_rows)
head(result_df)
## # A tibble: 6 x 6
## observed_label fold prob_positive_rf predicted_label_~ prob_positive_tr
## <fct> <chr> <dbl> <chr> <dbl>
## 1 up Fold1 0.322 down 0.333
## 2 down Fold1 0.338 down 0
## 3 up Fold1 0.522 up 0
## 4 up Fold1 0.412 down 0.333
## 5 up Fold1 0.500 down 0.556
## 6 up Fold1 0.224 down 0.333
## # ... with 1 more variable: predicted_label_tr <chr>
Here we have caluclated the predictions for both the tree and the RF, one for each in each of the 5 folds. Now, we need to analyse this data somehow.
error_rates <- result_df %>%
#calculate error
mutate(error_rf = observed_label != predicted_label_rf,
error_tr = observed_label != predicted_label_tr) %>%
#gather and take a mean (true(is error) = 1, false = 0)
group_by(fold)%>%
summarize(rf = mean(error_rf), tr = mean(error_tr)) %>%
tidyr::gather(model, error, -fold)
dotplot(error~model, data=error_rates, ylab="Mean Prediction Error")
error_rates %>%
#make a regression
lm(error~model, data=.) %>%
#get data on it
broom::tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 0.5002020 0.04083358 12.249773 1.832110e-06
## 2 modeltr -0.1248485 0.05774740 -2.161976 6.259637e-02
# create a list of true observed labels
labels <- split(result_df$observed_label, result_df$fold)
# now create a list of predictions for the RF and pass it to the ROCR::prediction function
predictions_rf <- split(result_df$prob_positive_rf, result_df$fold) %>% prediction(labels)
# do the same for the tree
predictions_tr <- split(result_df$prob_positive_tr, result_df$fold) %>% prediction(labels)
# compute average AUC for the RF
mean_auc_rf <- predictions_rf %>%
performance(measure="auc") %>%
# I know, this line is ugly, but that's how it is
slot("y.values") %>% unlist() %>%
mean()
# compute average AUC for the tree
mean_auc_tr <- predictions_tr %>%
performance(measure="auc") %>%
slot("y.values") %>% unlist() %>%
mean()
# plot the ROC curve for the RF
predictions_rf %>%
performance(measure="tpr", x.measure="fpr") %>%
plot(avg="threshold", col="orange", lwd=2)
# plot the ROC curve for the tree
predictions_tr %>%
performance(measure="tpr", x.measure="fpr") %>%
plot(avg="threshold", col="blue", lwd=2, add=TRUE)
# add a legend to the plot
legend("bottomright",
legend=paste(c("rf", "tree"), " AUC:", round(c(mean_auc_rf, mean_auc_tr), digits=3)),
col=c("orange", "blue"))
When looking at the graph, it seems that the random forest is consistantly better than the RF. Indeed, when looking at the output of the regression, we see that the modeltr variable (-.12) is negative. Because the intercept is the average error rate for the random forests, this negative value means that the tree’s average error rate is that much better. The p-value for the modeltr entry is almost significant at the ubiuquitous 95% percent confidence level (0.06), but it is not quite small enough to reasonably reject the null hypothesis that a random forest and a tree have the same true average error.
The areas under the curve are less than stellar. If the AUC = .5, that means that the method in question gets as many false positives as true ones: it is just as good as random. These values, both around .6, aren’t much better than that. Therefore, these classifiers do not seem to be very effective. This is probably because of the really bad training data, as I mentioned above.
Again, this is meant as a framework: once you have it you can plug your better data into it pretty easily.
Back to all that regression stuff…
At the end of the day, we were able to reject the null hypothesis that the slope of a linear regression line between the number of firearm regulations and firearm violence was zero. The new slope is slightly negative, which would please the people in the first, pro-regulation camp. However, the data set has many problems with the preconditions for regression that this line is not very trustworty, and we are still in the dark, beholden to the sensationalist media.
Along the way, we learned how to import data from CSV and .xlsx, tidy the data, represent data on an interactive graph, a little Machine Learning, and some linear regression. I know I said that we would use the data from that table we scraped to add weights to gun laws we found especially important, but I ran out of time for that. We did get to scrape that dataset though, which is a good skill to know.
Thank you, have a good summer, and tell the Professor to give credit to his sources!